Compiler Supported Interval Optimisation for Communication Induced Checkpointing

نویسندگان

  • Jörg Preißinger
  • Mark Pflüger
چکیده

There exist mainly three different approaches of checkpoint-based recovery mechanisms for distributed systems: coordinated checkpointing, uncoordinated checkpointing and communication induced checkpointing. It can be shown that communication induced checkpointing theoretically has the least minimum overhead, but also that the effective overhead depends on the communication behaviour and the resulting forced checkpoints. If the placement of checkpoints and the communication pattern is disadvantageous, the overhead can get arbitrary large due to a high number of forced checkpoints. We introduce a compiler supported approach to avoid unfavourable combinations of communication behaviour and local checkpoint placement. We analyse the application statically and prepare the placement of voluntary checkpoints. These placement decisions are reviewed during runtime. With this approach we optimise the effective checkpoint-intevals of voluntary and forced checkpoints and thus reduce the overhead of communication induced checkpointing.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Compiler-assisted Full Checkpointing

This paper describes a compiler-based approach to checkpointing for process recovery. The implementation is transparent to both the programmer and the hardware. The compiler-generated sparse potential checkpoint code maintains the desired checkpoint interval. Adaptive checkpointing reduces the size of the checkpoints. Training is used to select low-cost, high-coverage potential checkpoints. The...

متن کامل

Protocol for Coordinated Checkpointing using Smart Interval with Dual Coordinator

Introduction to Distributed System Design, Google Code University, http://code. google. com/edu/parallel/dsd-tutorial. html#Basics D. Manivannan, R. H. B. Netzer & M. Singhal, "Finding Consistent Global Checkpoints in a Distributed Computation", IEEE Trans. On Parallel & Distributed Systems, Vol. 8, No. 6, pp. 623-627 (June 1997) J. Tsai & S. Kuo, "Theoretical Analysis for Commun...

متن کامل

Type-Safe Object Exchange Between Applications and a DSM Kernel

The Plurix project implements an object-oriented Operating System (OS) for PC clusters. Communication is achieved via shared objects in a Distributed Shared Memory (DSM) using restartable transactions and an optimistic synchronization scheme to guarantee memory consistency. We contend that coupling object orientation with the DSM property allows a type-consistent system bootstrapping, quick sys...

متن کامل

Adjoints for Time-Dependent Optimal Control

The use of discrete adjoints in the context of a hard time-dependent optimal control problem is considered. Gradients required for the steepest descent method are computed by code that is generated automatically by the differentiation-enabled NAGWare Fortran compiler. Single time steps are taped using an overloading approach. The entire evolution is reversed based on an efficient checkpointing ...

متن کامل

A Performability Model for Applications using Checkpointing

An analytical model is used to investigate the effects of checkpointing on the performance and availability of sequential and parallel applications. Known as Steady-State Performability (SSP), this model provides a probabilistic method for quantifying delivered performance considering failure and recovery. Input parameters describe both the distributed application and the processing environment...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007